Olewave - Professional and Trustworthy - Data Services and Solutions

What makes

Olewave's Speech Data Cleaning Pipeline Effective and Unique

Olewave’s state-of-the-art and highly customizable speech data cleaning pipeline seamlessly integrates with its speech data collection pipeline, together forming the Olewave’s speech data curation pipeline. This end-to-end system is meticulously designed to deliver high-quality, consistent, and cost-effective speech datasets at scale, tailored for a wide range of downstream applications. Additionally, we are offering large-scale, cost-effective speech datasets curated through our pipeline, available in multiple languages and covering diverse topics.

The figure below illustrates the streamlined workflow of our speech data cleaning pipeline, showcasing its ability to transform raw, unstructured speech data into polished, actionable datasets.

Our pipeline offers several advantages:

Effective: It produces validated speaker labels and transcriptions, accurate word timestamps and not-overconfident confidence scores compared to tools like Whisper and other open-source solutions.
Robustness: It handles improvised conversational speech, including scenarios with speaker overlap and transcripts containing ASR errors.
Extensible: It leverages optional metadata and can be upgraded to incorporate additional modality for increased label reliability. The pipeline also supports plug-and-play integration of various Olewave’s or client’s own label tagging models (e.g., emotion, semantics, …).
Efficiency: It runs quickly and is cost-effective, as it requires no or little GPU resources.

Additionally, if your goal is to have human annotators refine Automatic Speech Recognition (ASR) results, the output of our pipeline can significantly reduce labeling costs by:

Precisely highlighting words with accurate timing information, enabling annotators to quickly locate and review mumbled or unclear segments without repeatedly listening to the entire audio.

Guiding annotators to focus on words with mid-range confidence scores, minimizing the need to review every word and streamlining the correction process.

By providing these targeted insights, our pipeline enhances efficiency, reduces manual effort, and ensures a more cost-effective annotation workflow.

Below are the visualized examples of the pipeline’s output, showcasing speaker label, word-level timestamps and confidence scores:

Fig 1. The conversation is from real-world interactions, not synthetic or prompted speech. The transcripts were manually uploaded by humans, not generated by ASR. Speaker labels were also manually added by humans, not derived from speaker diariazation algorithms. The accompanying JSON file includes detailed transcriptions, with speaker time intervals, conversation transcripts, word-level timestamps, and confidence scores. Phone-level timestamps and confidence scores are available upon request too.

Figure 2. An example of the postprocessing of noisy spontaneous speech, where the value beneath each word represents its pronunciation score, which is closely related to speech assessment. Here is the explanation of the scores of some words: a) The word 'together': It was pronounced like 'togeth-', the ending '-er' is not presented, and the '-o-' vowel is very short in duration, almost missing. b) The starting word 'it': I can barely hear it though there is a fricative phone on the spectrogram. It has a low score 0.47 which means the word is likely to be missing. c) the inserted word 'lt' at the end: The person did not say it. It is because the transcript wrongly includes '&lt' ('<' symbol used in HTML). The very low score 0.13 indicates it is an inserted word which can be removed by our data cleaning pipeline.

It is widely recognized in both academia and industry that large End-to-End (E2E) frameworks and large E2E ASR models have inherent limitations when it comes to producing accurate word-level timestamps and reliable word-level confidence scores, respectively.

To address these limitations, we leverage our deep expertise in non-End-to-End speech-to-text alignment techniques and have developed sophisticated procedures that incorporate metadata information. This approach allows us to effectively clean non-spoken content (e.g., filler words, background noise transcriptions) from transcripts, ensuring higher accuracy and reliability in the final output.

Here is a snippet of word-level json output of the data pipeline. The ‘speaker’ field indicates the id of the speaker in this recording. The ‘start_time’ and ‘end_time’ fields are in seconds indicating the time span of the speaking speaker. The ‘block’ field contains one or multiple consecutive utterances from the same speaker. Each utterance has its own starting time, ending time, transcript, speaker overlap indicator, and confidence score. When an ‘overlap’ field is true, it means the current segment has a speaker overlap with adjacent segments. The confidence score ranges from 0 to 1.

{

"speaker": 0,

"start_time": "47.18",

"end_time": "49.83",

"block": [

{

"start_time": "47.18",

"end_time": "49.83",

"transcript": "music-related jobs compared to New York and LA.",

"overlap": false,

"word_details": [

{

"start_time": "47.18",

"end_time": "47.53",

"word": "music",

"confidence": "0.97"

{

"start_time": "47.53",

"end_time": "47.89",

"word": "related",

"confidence": "0.79"

{

"start_time": "47.89",

"end_time": "48.44",

"word": "jobs",

"confidence": "0.98"

{

"start_time": "48.44",

"end_time": "48.90",

"word": "compared",

"confidence": "0.96"

{

"start_time": "48.90",

"end_time": "48.96",

"word": "to",

"confidence": "0.17"

{

"start_time": "48.96",

"end_time": "49.08",

"word": "new",

"confidence": "0.97"

{

"start_time": "49.08",

"end_time": "49.37",

"word": "york",

"confidence": "0.98"

{

"start_time": "49.37",

"end_time": "49.49",

"word": "and",

"confidence": "0.92"

{

"start_time": "49.49",

"end_time": "49.83",

"word": "la",

"confidence": "0.56"

}

]

}

]